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Abstract 

Background: Genome-wide scans for regions tliat demonstrate deviating patterns of genetic variation liave 
become common approaclies for finding genes targeted by selection. Several genomic patterns have been utilized 
for this purpose, including deviations in haplotype homozygosity, frequency spectra and genetic differentiation 
between populations. 

Results: We describe a novel approach based on the Maximum Frequency of Private Haplotypes - MFPH - to 
search for signals of recent population-specific selection. The MFPH statistic is straightforward to compute for 
phased SNP- and sequence-data. Using both simulated and empirical data, we show that MFPH can be a powerful 
statistic to detect recent population-specific selection, that it performs at the same level as other commonly used 
summary statistics (e.g. Fsj, iHS and XP-EHH), and that MFPH in some cases capture signals of selection that are 
missed by other statistics. For instance, in the Maasai, MFPH reveals a strong signal of selection in a region where 
other investigated statistics fail to pick up a clear signal that contains the genes DOCKS, MAPKAPK3 and CISH. This 
region has been suggested to affect height in many populations based on phenotype-genotype association studies. 
It has specifically been suggested to be targeted by selection in Pygmy groups, which are on the opposite end of 
the human height spectrum compared to the Maasai. 

Conclusions: From the analysis of both simulated and publicly available empirical data, we show that MFPH 
represents a summary statistic that can provide further insight concerning population-specific adaptation. 
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Background 

With the advent of new sequencing and SNP-genotyping 
technologies, searching for genomic regions affected by 
selection has become part of a standard population gen- 
etic analysis. Various types of selection cause deviations 
from the neutral expectation in patterns of genetic vari- 
ation around particular loci under selection (e.g. [1]). Sev- 
eral approaches for detecting these regions have been 
developed, including deviations in haplotype homozygos- 
ity, frequency spectra or genetic differentiation between 
populations. The basic principle often involves computing 
a summary statistic across the genome and then search for 
genomic regions that are outliers relative to the genome- 
wide distribution. Some approaches search for deviations 
in the allele frequency spectrum [2,3], others focus on ex- 
treme patterns of extended haplotype homozygosity [4-6], 
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and some utilize signals of extraordinary population- 
differentiation (e.g. [7]). These methods have varying power 
to detect signals of selection depending on how far back in 
time the selection occurred [8]. 

Many species and populations have been found to have 
adapted to local environments, such as climate conditions, 
food resources, and pathogen exposure. Evidence for adap- 
tation to soil conditions have been found in some Arabi- 
dopsis lyrata populations [9], and adaptation to climate 
conditions have been found in some Arabidopsis thaliana 
populations [10]. Examples of adaptation to local condi- 
tions have also been found in animals, including pigmenta- 
tion variation in mice [11], wing patterns in butterflies, and 
adaptation to depth in the lake trout [12]. Population- 
specific selection or local adaptation is typically a recent 
phenomenon (at least on an evolutionary time-scale), and 
migration can easily obscure the signal in the genome over 
time, making signals of local adaptation particularly diffi- 
cult to detect. 

Humans have also been exposed to new environments 
and living conditions when colonizing new geographical 
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areas and adopting various lifestyles. A handful of regions 
in the human genomes have been linked to population- 
specific selection, including lactase persistence connected 
to the LCr-gene region that emerged independently in 
northwestern Europeans [13] and pastoralist groups in 
Africa [14,15]; resistance to infections connected to the 
CCR5 gene [16]; copy number variation in the amylase 
gene {AMYl) improving the capacity to digest starch-rich 
diets [17]; genes affecting skin pigmentation in East Asians 
and Europeans [18]; resistance to malaria [19]; and adapta- 
tion to living at high altitudes [20,21]. Studies of local 
adaptation and the characterization of genome-local pat- 
terns of variation among humans may help us to under- 
stand the historical and cultural differences among human 
populations, and may also be informative of different 
metabolic reactions to medicines and nourishment [22]. 
Many of these examples of local adaptation have been de- 
tected by candidate gene approaches, but with the wealth 
of genomic data being accumulated, genome-wide scans 
for selected regions have become feasible. 

With strong selection acting on a gene, the favored vari- 
ant will increase rapidly in frequency in a short enough 
time so that recombination does not break down the cor- 
relation between SNP-variants around the selected variant. 
This phenomenon tends to decrease genetic diversity 
around the selected gene - a selective sweep [23] - and 
create high-frequency haplotypes. If the variant arose (or 
became frequent starting from a low level) in a particular 
population, population-specific selection could potentially 
be detected as private alleles at high frequency. Among 
the approaches used for detecting selection, only Fst> XP- 
EHH [6] and XP-EHHST [24] explicitly focus on multiple 
populations to assess local adaptation. In order to capture 
signals of local adaptation, we developed a new statistic: 
the Maximum Frequency of Private Haplotypes (MFPH) 
in subpopulations. MFPH is based on haplotypes, i.e., 
combinations of SNP-variants along a chromosome for a 
particular genomic region. We define private haplotypes 
as haplotypes that are found in the sample from a focal 
population, which are absent in the samples from other 
populations. In the analyzes presented in this paper, we re- 
quire haplotypes to be completely unique to a sample to 
qualify as private, but this criteria can easily be modified 
to allow for a low frequency of the same haplotype in 
other samples (see Material and Methods). We investigate 
the properties of this statistic using simulations and pub- 
licly available data from humans, as well as comparing its 
performance to other statistics commonly used for detect- 
ing selection. 

Results 

First, we study the behavior of MFPH for simulated data 
using a population divergence model (Figure 1), both 
with population specific selection, and without selection 



1 



Figure 1 Scheme of the model used in the simulation study. 

An ancestral population of 500 diploid individuals reaches mutation-drift 
equilibrium during B generations, it then splits into three populations of 
500 random-mating diploid individuals each. At time t^ a mutation 
occurs in population 3, which is adaptive in population 3 if G > 0. At time 
ts, 15 individuals are sampled from each of the three populations. 

V J 

(the "neutral cases"). Second, we investigate HapMap III 
SNP genotype data to validate that MFPH picks up sig- 
nals at some of the most well-characterized examples of 
strong population-specific selection in the human gen- 
ome. Third, we discuss some regions in the HapMap III 
data that are exclusively picked up by MFPH and not by 
the other investigated statistics. 

Factors that influence MFPH 

To characterize the sensitivity of MFPH to confound- 
ing factors, we investigate the impact of various 
population- and genetic parameters on MFPH. We also 
compare the performance of MFPH to other statistics 
used to detect selection. 

The strength of selection (G) naturally affects the signal 
of selection and the difference between the selected and 
neutral cases increases with increasing G (Figure 2A). If 
selection is very strong (G > 250), MFPH starts to de- 
crease, probably due that the selected variant quickly 
fixed in the focal population and that the beneficial vari- 
ant spread via migration to neighboring populations 
(Additional file 1: Table S2 shows that for very high se- 
lection coefficients, the most frequent allele in population 
3 - the population where selection is acting - is typically 
almost fixed and not unique to population 3). 

The mean of MFPH decreases with sampling time and 
there is essentially no signal of selection when the sampling 
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(See figure on previous page.) 

Figure 2 Effects of various properties on Fst measures, iHS, XP-EHH and MFPH. Influence of sampling time, migration rate, selection 
strength and recombination rate in simulations with selection (G > 0) (red line) and simulations without selection (G = 0, blue line). Mean values 
were calculated on 5 kb-windows containing the variant at site 50,001 and averaged over 100 simulations. Unless variable along the x-axis, the 
default values for the parameters were: G = 1 50, p = 0.001 , m = 1 , 9 = 0.001 , t^ = 1 00, ts = 50, N = 500. A to D: mean MFPH, E to H: mean Fsj for 
SNPs, I to L: mean Fsj for haplotypes, M to P: mean iHS (absolute value), Q to T: mean XP-EHH. 



occurred more than 200 generations after the emergence of 
the selected variant (Figure 2B). 

Since MFPH is based on private haplotypes, migration 
will aft'ect MFPH. As shown in Figure 2C, at migration 
rates above 10 migrants per generation, the difference be- 
tween cases with and without selection becomes small, 
and when the migration rate reaches 20, discriminating 
between the neutral and selected cases becomes difficult 
(Figure 2C and Additional file 1: Table S2). 

Another factor that impacts MFPH is the recombination 
rate. Simulations with selection, revealed a decrease in 
MFPH with increasing recombination rate (Figure 2D). 
However, MFPH was much greater in simulations with se- 
lection compared to simulations without selection, even 
for relatively large recombination rates (Figures 2D and 
Figure 3). For high recombination rates (low levels of LD, 
Figure 3A and B), MFPH drops rapidly towards the value 
under neutrality as the distance from the selected site in- 
creases. In contrast, if the recombination rate is low (high 
levels of LD, Figure 3D), MFPH remains above the level of 
the neutral case over a much longer region. 

The choice of window-size also impacts MFPH. For ex- 
ample, as the window-size increases, the magnitude of the 
peak at the selected site decreases while the width of the 
peak increases (Figure 4A, D and G). The decrease in 
MFPH at the selected site is likely an effect of that many 
distinct low-frequency haplotypes dominate the haplotype- 
window if the window-size is large and that there is more 
than one haplotype under positive selection (increasing the 
recombination rate has a similar effect). This phenomenon 
is also evident for the Fst measures, in particular Fst based 
on haplotypes (Figure 4). However, even a ten-fold differ- 
ence in window-size had a minor impact on the qualitative 
behavior of MFPH in our simulations (Additional file 1: 
Figure S3). 

Comparing MFPH to other statistics used for detecting 
selection 

Various summary statistics commonly used to search for 
signals of selection were also computed based on the same 
data to compare with MFPH, including iHS [5], XP-EHH 
[6] and Fst [25,26]. We compute two different versions of 
Fst- Fst based on the haplotypes defined by a specific win- 
dow (which we refer to as "Fst haplotype") and the aver- 
age value of Fst across SNPs in a specific window ("Fst 
SNP") (see Additional file 1). Other commonly used sum- 
mary statistics for detecting signals of selection include 



Tajimas D [2], and Fay & Wus H [3]. These statistics were 
however only included for completeness since they are 
not based on haplotypes or specifically designed to detect 
population-specific selection. 

Overall, the factors that influence MFPH have similar ef- 
fects on FsT> iHS and XP-EHH (Figure 2, see Additional 
file 1: Figure S4 for the behavior of Tajimas D and Fay & 
Wus H). Sampling time have a relatively small effect on 
XP-EHH and Fst based on SNPs and the signal of selec- 
tion can be detected for long time-periods after the emer- 
gence of the selected variant (Figure 2). MFPH, iHS and 
Fst based on haplotypes capture the selection signal well 
if the selected variant emerged recently (less than 100 gen- 
erations ago), but fails to detect selection on variants that 
emerged earlier. Migration has a strong effect on the abil- 
ity of Fst measures to pick up the selection signal, similar 
to the behavior of MFPH. In contrast, iHS and XP-EHH 
can distinguish a selection signal even if the migration rate 
is substantial. Compared to MFPH and iHS, both Fst 
measures and XP-EHH are better at distinguishing cases 
with population-specific selection from neutral cases if the 
selection coefficient is large. The somewhat poorer per- 
formance of MFPH and iHS in this case may be due to 
the loss in power when the advantageous variant is close 
to fixation [5,6]. MFPH, iHS and XP-EHH are more sensi- 
tive to weak selection (G < 100) while Fst based on haplo- 
types start to pick up a selection signal only when G 
reaches 150. All investigated statistics show decreasing 
power to detect selection with increasing recombination 
rate. However, even for the greatest recombination rates 
we investigate here (up to 20 times greater than the muta- 
tion rate), the statistics were able to distinguish the cases 
with selection from the cases without selection (except 
perhaps for Fst based on haplotypes). 

HapMap III data 

We computed MFPH for the following HapMap III 
populations: Maasai from Kinyawa in Kenya (MKK), 
CEPH Europeans from Utah of north-western European 
descent (CEU) and Japanese from Tokyo together with 
Han Chinese from Beijing (JPT + CHB). These popula- 
tions were selected to minimize the occurrence of recent 
migration between populations and because particular 
population-specific selection events have been described 
for these populations. 

Based on these three populations, the greatest genome- 
wide value of MFPH is located around the LCT gene on 
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chromosome 2 in Maasai and north-western Europeans 
(Figure 5 and Additional file 1: Figure S5-S7), which is 
consistent with previous results revealing selection for 
lactase persistence in this region and in these popula- 
tions [13-15]. Large MFPH values for the East Asian 
population were found on chromosome 2 and 4, specif- 
ically overlapping the EDAR gene region on chromo- 
some 2 and the ADHIB gene region on chromosome 4, 
also consistent with previous results [6,27,28]. The dis- 
tinct MFPH signals around the LCr gene region for the 
Maasai and the north-western Europeans as well as the 
signal around the ADHIB and EDAR genes in the East 
Asian population show that MFPH has power to detect 
population specific selection events (Figure 5 and 



Additional file 1: Figure S5-S7, see also Additional file 1: 
Figures S8-S12 for a comparison of MFPH to XP-EHH, 
iHS and Est haplotype in these regions). 

The variance of MFPH was greatest for the East Asian 
population (Additional file 1: Figure S5-S7), followed by the 
north-western European population and the East African 
population. This can be a consequence of the demographic 
history of these populations with well documented bottle- 
necks affecting the Asian and the European populations 
[29-31]. The choice of reference populations and sample 
size also affects MFPH. For instance, computing MFPH 
across the genome for all ten HapMap III populations re- 
sults in that the MFPH-signal disappears around the LCT- 
gene region in the north-western European population 
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Figure 4 Effect of the window-size on the three window-size dependent summary statistics. The figure shows scans along the same data 
using three different window-sizes: 1, 5, and 10 l<b, with step lengths 500 bp, 2.5 kb and 5 kb respectively. Parameters: G = 150, p = 0.001, m = 1, 9 = 0.001, 
tm = 1 00, ts = 50, N = 500. MFPH, Fst SNP and Fst haplotype are averaged over 1 00 simulations. A to C: MFPH, D to F: Fsj haplotype, G to I: Fsj SNP. 



(Additional file 1: Figure SI 3). This effect of pooling is not 
surprising considering that several of these populations 
have similar genetic background, and haplotypes are likely 
to be shared across these populations (e.g. between the 
north-western European population (CEU) and the British 
population (GRB)), which will impact statistics that rely on 
population differentiation (like MFPH, XP-EHH and Fst). 
This also illustrates that conducting scans for local adapta- 
tion on different sets of populations can, in fact, provide in- 
formation about the nature of the selective event. 

Finally we investigated the top MFPH signals after ex- 
cluding chromosome 2 (on which both EDAR and LCT 
are located) in the three populations. For the European 
sample, eleven windows were in the extreme top tail 
(8.86n0-^ tail) and had an MFPH value of 16/34 (1 win- 
dow) and 15/34 (10 windows) (the exact ratios are due to 
that MFPH has a discrete set of possible values with n^-l 
possible values for a sample of size n). For the African 
sample, 75 windows (corresponding to the 6.04*10"^ tail) 
had values of 11/34 (1 window) and 10/34 (74 windows) 
and 42 windows (corresponding to the 3.38'' 10"^ tail) had 
a value of 25/34 for the Asian sample. These candidate 
windows were often adjacent to each other in each popu- 
lation and clustered into two regions for the European 
and the African sample and one region for the Asian sam- 
ple (Table 1, see also Additional file 1: Figure S5). As we 
were specifically searching for windows where MFPH 
showed a strong signal while there was little signal in the 
other investigated statistics we focused on the two African 
Maasai candidate windows on chromosome 3 for which 
there was little evidence of selection based on iHS, XP- 
EHH and the two Fst measures. One of these regions is 



located on chromosome 3 between 50.6 and 51.3 Mb 
and contains, inter alia, the genes CISH (cytokin in- 
duced STAT inhibitor), MAPKAPK3 (MAP kinase- 
activated protein kinase 3, Ser/Thr kinase) and DOCKS 
(dedicator of cytokinesis 3) - all potentially affecting 
height [32] (Additional file 1: Figure S14A). The other 
candidate region is between 101 and 101.4 Mb (on 
chromosome 3) containing the genes IMPG2 (interpho- 
toreceptor matrix proteoglycan-2), SENP7 (SUMOl/ 
sentrin specific peptidase 7) and PCNP (PEST proteo- 
lytic signal containing nuclear protein; Additional file 1: 
Figure S14B). 

In the CEU population, a region located around 74 Mb 
on chromosome 10 show a peak in MFPH (Additional 
file 1: Figure S5 and S15). This region contains the genes 
(among others) MCU (mitochondrial calcium uniporter), 
MRPS16 (human mitochondrial ribosomal protein SI 6) 
and PLA2G12B (shown to be important for HDL choles- 
terol levels in mice [33]). 

Discussion 

In this study we present a new haplotype-based statistic 
for detecting population specific positive-selection, which 
is intuitive and easy to compute. We compare the behav- 
ior of MFPH to similar and commonly used summary sta- 
tistics for detecting selection, including Fst [25], XP-EHH 
([6], see also [24] for an additional example of a similar 
statistic) and iHS [5]. These summary statistics have often 
been used in scans for regions targeted by selection relying 
on an outlier approach. The conceptual idea of the outlier 
approach is that if there are regions targeted by selecti- 
on - but that these are relatively rare- these regions are 
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likely to show up as outliers compared to the genome-wide 
distribution. These outlier-regions are therefore potential 
targets for selection, although it is difficult to assess signifi- 
cance for a set of identified outliers to be true targets for se- 
lection (see e.g. [34-36]). 

Using both simulations and empirical data we con- 
clude that MFPH has similar power for detecting selection 
compared to many other summary statistics (Figure 2 and 
Additional file 1: Figure S4). We show that MFPH detects 
a clear signal of selection in some of the most well-known 
examples of selection in the human genome: the LCT 
gene-region in Maasai and north-western Europeans and 
EDAR and ADHIB in East Asians (Figure 5). Using 
genome-wide correlations we find that MFPH correlates 
the strongest with haplotype based Est followed by either 



XP-EHH or iHS depending on the population considered 
(Figure 6). This population dependency illustrates that 
MFPH is an additional source of information compared to 
iHS, XP-EHH and Fst- 

An MFPH scan of the Hapmap III data revealed five top 
regions, two in the Maasai, two in the European sample 
and one region in the Asian sample (Table 1). The two re- 
gions in the Maasai (both on chromosome 3) were not cap- 
tured by any of the other statistics and one of these regions, 
the region around position 51 Mb, (Additional file 1: Figure 
S14A) coincides with a region that has been implicated as a 
target for selection on stature in Pygmy groups [32]. While 
the average stature within Pygmy populations is exception- 
ally short compared to other African populations [37], the 
Maasai are among the tallest [38]. Interestingly the Pygmy 
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Table 1 The regions with the highest MFPH value across the Hapmap III data after excluding chromosome 2 



Population 



Region 



Genes 



CEU 


Chrl 0:74,41 6,452-75,1 02,866 


MCU, 0IT3, PLA2G12B, NUDT13, ECD, DNAJC9, MRPS16, JTOS 


CEU 


Chr6:l 45,1 90,620-1 45,554,235 




MKK 


Chr3:l 00,979,041 -101, 375,5 15 


IMPG2, SENP7, PCNP 


MKK 


Chr3:50,61 7,979-51,354,540 


HEMKl, CISH, MAPKAPK3, D0CK3 


JPT + CHB 


Chrl 5:62,232,223-62,888,060 


TLN2, VPS13C, C2CD4A, C2CD4B 



populations and Maasai show distinctly different genotypes 
in this region (Additional file 1: Figure S16) suggesting that 
different haplotypes in the region have been targeted by se- 
lection for stature in the Maasai and the Pygmy popula- 
tions. There are three genes associated with variation in 
height in this region [32]: DOCKS, a guanine nucleotide ex- 
change factor that has been associated with height variation 
in Europeans [39], the CISH gene which has been shown to 
inhibit growth factors [40] and MAPKAPK3, involved in 
growth, development and stress [41]. This region has a low 
level of LD (and hence a small genetic distance (in cM) for 
LD-based genetic maps such as the HapMap genetic maps). 
Indeed, if windows based on cM are used to compute 
MFPH, this particular region would not be a top candidate 
in MKK (see Additional file 1: Figure S7). However, since 
one of the most characteristic signals of selective sweeps is 
high LD, using windows from LD-based recombination 
maps will lil<ely result in substantial loss of power for any 
haplotype based statistic targeting selective sweeps. Indeed, 
the LCT region also has high LD (and small genetic dis- 
tances in cM for HapMap recombiantion maps) due to re- 
current selective sweeps (at least two selective sweeps 
occurred in the LCT region [13-15]). For MFPH one can 
choose to control for diversity (using SNP-windows) or 



control for recombination rate (using cM-windows; see 
Additional file 1 for correlations between MFPH and gen- 
etic distances). 

The region situated around 101 Mb on chromosome 3 
(Additional file 1: Figure S14B) includes the gene IMPG2 
which codes for an interphotoreceptor matrix proteogly- 
can. This gene has been pointed out as important in dia- 
betic retinopathy [42,43] and thus possibly also involved 
in other types of retinopathies such as solar retinopathy. 
Although little is known about the molecular mechanisms 
of solar retinopathies, individuals with greater exposure to 
sunlight show greater frequency of solar retinopathies [44] 
which could potentially have led to adaptation targeting 
the IMPG2 gene among the Maasai as an effect of expos- 
ure to sunlight and UV radiation (at least compared to the 
comparative European and Asian populations). While the 
51 Mb-region (chromosome 3) has been implicated as a 
target for selection before, neither of these regions on 
chromosome 3 would have been found and highlighted as 
candidate regions for local adaptation in the Maasai based 
on iHS, XP-EHH or the two Est measures. 

Similarly, in CEU, the MFPH peak around 74 Mb on 
chromosome 10 (Additional file 1: Figure SI 5) may indi- 
cate that this region has been under recent positive 
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window-size of 300 SNPs with a step length of 1 SNP. 
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selection while there is little indication of this based on 
the other statistics. This region contains at least two in- 
teresting genes, P4HA and PLA2G12B. PLA2G12B 
codes for a phospholipase initially shown to be lacking 
activity [45] but also to be involved in HDL cholesterol 
level in mouse [33]. PLA2G12B is a member of the 
PLA2 group of genes that are globally involved in many 
mechanisms like lipid digestion, inflammation and degrad- 
ation of bacterial phospholipids (cited in [46]). P4HA is re- 
sponsible for the synthesis of collagen and is interestingly 
expressed in macrophages and thus probably involved 
in the repair of injured or inflamed tissues [47]. Thus, 
though more information is required, there is some evi- 
dence that alleles of these two genes could have for been 
targets of selection among Europeans in response to 
pathogen exposure. 

To closer assess the additional information contained 
in MFPH relative to XP-EHH and haplotype based Fst 
in the presence of selection, we used simulations with 
population-specific selection. We used two standard de- 
viations from the hypothetical genome-wide mean (here 
represented by simulations where the selection coeffi- 
cient is set to zero) for each summary statistic as the indi- 
cator of selection. This set-up allowed us to quantify how 
often MFPH detects (or fails to detect) a signal of selection 
that was detected (or not detected) by XP-EHH or Fst 
Haplotype (Figure 7). There were many cases when MFPH 
finds a (true) signal of selection which was missed by the 
alternative statistics (XP-EHH or Fst) implying that 
MFPH provides additional information, and there were 
also many cases when either XP-EHH or Fst detected se- 
lection while it was missed by the other statistics (Figure 7). 
In the simulations with very strong selection, MFPH de- 
tected a subset of cases compared to either XP-EHH or 
Fst. Interestingly, there seemed to be considerably less 
overlap in signal between MFPH and either Fst or XP- 
EHH than between XP-EHH and Fst suggesting that com- 
bining either XP-EHH or Fst with MFPH may capture a 
larger set of the selection-cases compared to the combin- 
ation of XP-EHH and Fst (Figure 7). 

MFPH depends on the choice of populations being con- 
trasted. Since it is based on population-specificity, compar- 
ing recently diverged populations or admixed population 
will decrease the power of MFPH, but it is easy to adjust 
the computation of MFPH to allow some level of haplotype 
sharing among populations. Contrasting a focal subpopula- 
tion to a few selected populations in the HapMap III or to 
all HapMap III populations resulted in different outcomes. 
While some signals remained regardless of the choice of 
populations, other signals were lost if a larger set of popula- 
tions were used (Additional file 1: Figure S13), which can 
be understood by considering the relationship of the popu- 
lations. This type of information can also be used to investi- 
gate (for instance) the age of the selective event as well as 



pinpointing which particular populations have been af- 
fected by selection. 

MFPH also depends on the choice of window-size 
(Figure 4). In theory, the strength of selection, the time 
since the selection started and the recombination rate 
should govern the expected width of the region around 
a selected site that retains a signal of deviation from the 
genome-wide average. In other words, the size of a deviant 
region should contain information about the nature of the 
selection event. For example, since MFPH is straight- 
forward to compute for various choices of window-sizes, 
the effect of window-size can be integrated into the 
statistical framework (somewhat similar to the wavelet- 
transform analyzes in [48]) and help determine proper- 
ties of detected selection signals. 

The rapidly increasing amount of sequence data will be 
ideal to investigate using MFPH. For example, variants at 
low frequency (e.g. caused by sequencing errors or rare var- 
iants) will typically not influence the most frequent haplo- 
type and therefore not MFPH either. For the same reason 
is MFPH not likely to be efficient at detecting background 
selection or negative selection. MFPH is further only mar- 
ginally affected by phasing errors (Additional file 1) as 
phasing errors typically create low frequency haplotypes 
[49]. Compared to sequence based statistics such as Taji- 
mas D and Fay & Wu's H, MFPH also shares the feature 
with other haplotype based statistics of being less affected 
by SNP ascertainment biases [50] making it an ideal statis- 
tic for SNP data or low-coverage sequence data that fails to 
capture all variants. Finally, differences in variance of 
MFPH across populations suggest that demographic events 
influence MFPH to some degree and the effect of demog- 
raphy on MFPH should be assessed for investigations of 
specific populations (e.g. [51]). 

Conclusions 

Our simulation studies of population specific selection 
under various model parameters as well as comparisons to 
other summary statistics show that MFPH is a powerftil 
tool for detecting recent, relatively strong population- 
specific selection. We demonstrate that MFPH has similar 
power to Fst and XP-EHH (two similar and widely used 
statistics). Importantly, MFPH may capture events that are 
missed by other statistics. For instance, MFPH alone impli- 
cated selection in a gene-region in Maasai that has been 
pointed out as a candidate region for stature in Pygmy 
groups. Thus, MFPH constitutes a valuable additional sum- 
mary statistic for investigating local adaptation, possibly in 
a demography-informed approach utilizing, for instance. 
Approximate Bayesian Computation [52,53]. MFPH is well 
suited for analyzing large genome wide data since it is quick 
and easy to compute for phased data. Moreover, since 
MFPH is defined in terms of haplotypes, it is expected to 
be robust to effects of ascertainment bias and because it 
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(See figure on previous page.) 

Figure 7 Overlap of summary statistics when they are two standard deviations away from the mean in the simulated data. Default 
parameter values were G = 1 50, p = 0.001 , m = 1 , 9 = 0.001 , t^ = 1 00, ts = 50, N = 500. Mean and standard deviation were calculated on the 
corresponding neutral simulations (same parameters but with G = 0). A: MFPH and XP-EHH, B: MFPH and Fsj haplotype, C: XP-EHH and 
FsT haplotype. 



focuses on the maximum frequency of haplotypes, it should 
also be robust to phasing and sequencing errors that create 
rare haplotypes. 

Methods 

Definition of MFPH 

We focus on haplotypes, i.e., combinations of SNP-variants 
along a chromosome for a particular genome region. We de- 
fine private haplotypes as haplotypes that are found in the 
sample from one particular population, but absent in the 
samples from other populations. Note that "private" is sam- 
ple based and that a private haplotype can potentially be 
present in more than one population. Sample size affects the 
probability of sampling alleles, and in the case of unequal 
sample sizes, the rarefaction approach can be used to obtain 
comparable statistics [54,55], or, alternatively, down-sampling 
can be employed to obtain comparable sample sizes. 

Formally, let rii denote the number of sampled sequences 
from population / {i = 1 ... S), Focus on a locus / in a se- 
quence (a predefined window of either a specific number of 
consecutive SNPs or a specified length of a region in either 
base pairs or centimorgans). Let h(i,jj) denote the haplo- 
type of sequence ; in the sample from subpopulation / at 
locus /. A haplotype x is defined as private to population k 
at locus / iff: 

0 < sample frequency of x after excluding the sample 
from subpopulation k<8< sample frequency of x in sub- 
population k 

or 

5 Hi nk nk 

Z E I{h{iJ, I) = x)- E I{h{kJ, I) = x) E I{h{kJ, I) = x) 

^ 2 = 1 j=l 1=1 j=l 

< ' 

E rii-rik 

1=1 

where I is an indicator variable so that I (True) = 1 and I 
(False) = 0. Setting s = 0 implies that a haplotype is private 
to population k if and only if it is absent in all samples ex- 
cept the sample from population k while letting £ > 0 allows 
for a less strict definition of privacy. Let H(kJ) denote the 
set of haplotypes private to population k at locus /, then 

zmkj,i) = x) 

MFPHik, I) = max ^ 

x&H{k,i) nk 

liH(kl) is empty MFPH(kl) is defined to be 0. 



Model description and simulations 

In order to investigate the behavior of MFPH, we simu- 
lated genomic data where a specific locus is under positive 
selection using forward simulations implemented in the 
software SFS_code [56]. We model three populations of 
equal size {N = 500, diploid individuals) that split from an 
ancestral population {N = 500, diploid individuals) at time 
zero. The population size has been chosen arbitrarily in 
order to have reasonable computation times. Each simu- 
lated individual is represented by two chromosomes of 
length (L) 100,000 bp. Individuals can migrate between 
populations at rate m which represents the number of in- 
dividuals coming from the two other populations into one 
particular population each generation. Sites mutate with a 
population-scaled per site rate of 0 {6 =4Nf^, where is the 
per-site per generation mutation rate). Mutations occur 
under a pseudo-infinite site model (see the SFS_code docu- 
mentation [56] for more details). Recombination events 
occur with a population-scaled rate of p {p = 4Nr), where r 
is the probability of cross-over between two adjacent sites 
per generation. The population size scaled mutation rate 
per site was set to ^ = 0.001 (implying a scaled mutation 
rate for the fragment 6l = 100) and the recombination rate 
p was set to values between 0.001 and 0.02 (the scaled re- 
combination rate for the fragment, pi, was set between 100 
and 2000). Assuming a mutation rate of 1.25 x 10"^ per site 
per generation [57], our simulated 6l = 100 corresponds to 
a 4 Mb DNA fragment in a population of 500 (4 x 500 x 
1.25 X 10"^ X 4 X 10^ = 100) or, alternatively, a 200 kb DNA 
fragment in a population of 10,000 (4 x 10^ x 1.25 x 10"^ x 
2 X 10^ = 100). Since we are only interested in the variable 
sites, this simplification allowed faster simulations while 
producing realistic SNP and haplotype data. See Additional 
file 1: Table SI for parameter settings of the model. 

In order to be able to compare MFPH across inde- 
pendent simulations of the same model, we computed 
MFPH on bp-windows on simulated data. 

In the simulations, a mutation occurs in population 3 at 
the center of the chromosome (position 50,001 bp) at a 
fixed time t^ - the "mutation time", given in number of 
generations after the population split. Individuals in popula- 
tion 3 carrying the derived variant at this site have a select- 
ive advantage with a population-scaled selection coefficient 
G. Individuals in population 1 and 2 carrying this variant 
do not confer a selective advantage. Samples are drawn 
after an additional 4 generation following t^^ (t^ + 4 genera- 
tions after the population split). We refer to 4 as the "sam- 
pling time", see Figure 1 for an outline of the model. The 
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ancestral population is allowed to evolve for 5,000 genera- 
tions (this is a "burn-in time" to omit any effects of the 
starting conditions, see SFS_code manual) prior to the 
population split. Conditional on that site 50,001 is poly- 
morphic in the pool of the three populations at the time 
when the samples are drawn, we generated 100 simulations 
for each set of model parameters, and averaged the results 
across simulations. For each set of parameter values we also 
performed 100 comparative simulations without selection 
(G = 0) (the "neutral" cases), where we still conditioned on 
that site 50,001 was polymorphic in order to generate simu- 
lated data where the neutral and selected cases were as 
similar as possible. This conditioning likely had a minor in- 
fluence on our results: the frequency of the deterministic 
mutation in population 1 and 2 when it was under selection 
in population 3 (G > 0) was similar to the frequency in 
population 3 when there was no selection (G = 0, Additional 
file 1: Figure SI). In contrast, the frequency of the selected 
variant in population 3 was markedly increased when G > 0 
(Additional file 1: Figure SI). However, to further investigate 
whether this conditioning had a large influence on the neu- 
tral distribution, we performed 10,000 neutral simulations 
without this conditioning with the (relevant) default param- 
eter values. We compared the distribution of MFPH in a 
window overlapping the position of the inserted mutation 
when this mutation was present to when it was absent 
(Additional file 1: Figure S2). The distributions are similar 
and we conclude that conditioning on a mutation in the 
neutral simulations has little or no influence on MFPH. 

Computing MFPH for the HapMap III data 

We computed MFPH for the HapMap III phased data [58]. 
MFPH was calculated for sets of three populations after 
down-sampling the number of chromosomes to equal the 
sample size of the population with the smallest sample size. 
We computed MFPH (and the comparative statistics) for 
windows with a fixed number of SNPs, a fixed physical- 
size, and with a fixed size of windows in cM based on the 
HapMap II genetic map [59] (calculated on the combined 
CEU, YRI and JPT + CHB populations) with a step-size of 
one SNP between windows. 

To study the effect of how windows are defined, we com- 
puted the pairwise correlations between MFPH with a fixed 
number of base pairs (bp-windows) and MFPH with win- 
dows of a fixed genetic distance (cM-windows). For ease of 
comparison, sizes of bp-windows and cM-windows were 
chosen according to the mean base pair-size and cM-size 
of a 200 SNP-window on chromosome 2. Spearman and 
Pearson correlations were computed between MFPH based 
on SNP-windows, bp-windows and cM-windows and we 
found that SNP based windows, bp-windows, cM-windows 
are highly correlated (between 0.60 and 0.88 depending on 
the comparison). All three types of windows have respective 
advantages and disadvantages, and we present the results 



for windows with a fixed number of SNPs in the main text 
and based on bp-windows and cM-windows in the supple- 
mentary material. 

Additional file 



Additional file 1: Contains supplementary methods, results, figures 
and tables. 
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